{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 3: 应用开发基础\n",
    "\n",
    "此 notebook 介绍了`bigdl-llm`的基本用法，并带您逐步构建一个最基础的聊天应用程序。\n",
    "\n",
    "## 3.1 安装 `bigdl-llm`\n",
    "\n",
    "如果您尚未安装bigdl-llm，请按照以下示例进行安装。这一行命令将安装最新版本的`bigdl-llm`以及所有常见LLM应用程序开发所需的依赖项。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 加载预训练模型\n",
    "\n",
    "在使用LLM之前，您首先需要加载一个模型。这里我们以一个相对较小的LLM作为示例，即[open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2)。\n",
    "\n",
    "> **注意**\n",
    "> * `open_llama_3b_v2`是一个基于LLaMA结构的开源大语言模型，您可以在该模型在Huggingface上托管的[主页](https://huggingface.co/openlm-research/open_llama_3b_v2)上获取更多模型相关信息。\n",
    "\n",
    "### 3.2.1 加载和优化模型\n",
    "\n",
    "通常情况下，您只需要一行`optimize_model`就可以轻松优化已加载的任何PyTorch模型，无论采用的是什么库或者API。关于`optimize_model`更详细的用法，请参考[API文档](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html#).\n",
    "\n",
    "此外，大量流行的开源PyTorch大语言模型都可以使用`Huggingface Transformers API`（例如[AutoModel](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModel), [AutoModelForCasualLM](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModelForCausalLM) 等）来加载。对于这类模型，`bigdl-llm`也提供了一套API来支持。我们接下来展示一下这种API的用法。\n",
    "\n",
    "在这个例子里，我们将使用`bigdl.llm.transformers.AutoModelForCausalLM`来加载`open_llama_3b_v2`。这个API相对官方的`tranformers.AutoModelForCasualLM`，除了增加了一些低比特优化相关的参数和方法，其他部分在使用上完全一致。\n",
    "\n",
    "要应用INT4优化，只需在`from_pretrained`中指定`load_in_4bit=True`即可。另外根据经验，我们默认设置参数`torch_dtype=\"auto\"`和`low_cpu_mem_usage=True`，这会有利于性能和内存优化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForCausalLM\n",
    "\n",
    "model_path = 'openlm-research/open_llama_3b_v2'\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(model_path,\n",
    "                                             load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **注意**\n",
    "> * 如果您需要使用除了INT4以外的其他精度（例如INT3/INT5/INT8)，或者想了解更多API的详细参数，请参阅[API 文档](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html#)。\n",
    "\n",
    "> * `openlm-research/open_llama_3b_v2`是`open_llama_3b_v2`模型在huggingface上托管的model_id。如果将`from_pretrained`的`model_path`参数设置成model_id，`from_pretrained`会默认从huggingface上下载模型、缓存到本地路径（比如 `~/.cache/huggingface`）并加载。下载模型的过程可能会较久，您也可以自行下载模型，再将`model_path`变量修改为本地路径。关于`from_pretrained`的用法，请参考[这里](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained)。\n",
    "\n",
    "\n",
    "### 3.2.2 保存和加载优化后的模型\n",
    "\n",
    "在上一节中，用`Huggingface transformers` API加载的原模型通常是以fp32或fp16精度存储的。为了节省模型存储空间并加速后续加载过程，`bigdl-llm`还提供了`save_low_bit`接口用于保存低比特优化后的模型，以及`load_low_bit`接口用于加载已保存的优化模型。\n",
    "\n",
    "由于`load_low_bit`不需要读取原始的模型，也省去了优化模型的时间，通常我们可以做一次`save_low_bit`操作，然后将模型部署在不同平台上用`load_low_bit`加载并进行多次推理。这种方法既节省了内存，又提高了加载速度。而且，由于优化后的模型格式与平台无关，您可以在各种不同操作系统的计算机上无缝执行保存和加载操作。这种灵活性使您可以在内存更大的服务器上进行优化和保存操作，然后在有限内存的入门级个人电脑上部署模型进行推理应用。\n",
    "\n",
    "**保存优化后模型**\n",
    "\n",
    "例如，您可以使用`save_low_bit`函数来保存优化后模型，如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_directory = './open-llama-3b-v2-bigdl-llm-INT4'\n",
    "\n",
    "model.save_low_bit(save_directory)\n",
    "del(model)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**加载优化后模型**\n",
    "\n",
    "您可以使用`load_low_bit`函数加载优化后的模型，如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers\n",
    "model = AutoModelForCausalLM.load_low_bit(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 构建一个最简单的聊天应用\n",
    "\n",
    "现在，模型已经成功加载，可以开始构建我们的第一个聊天应用程序了。接下来将使用`Huggingface transformers`推理API来完成这个任务。\n",
    "\n",
    "\n",
    "> **注意**\n",
    "> \n",
    "> 本节中的代码完全使用`Huggingface transformers` API实现。`bigdl-llm`不需要在推理代码中进行任何更改，因此您可以在推理阶段使用任何库来构建您的应用程序。\n",
    "\n",
    "\n",
    "> **注意**\n",
    "> \n",
    "> 我们使用了Q&A的对话式模板，以更好地回答问题。\n",
    "\n",
    "\n",
    "> **注意**\n",
    "> \n",
    "> 您在调用`generate`函数时，可以通过修改`max_new_tokens`参数来指定要预测的tokens数目上限。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import LlamaTokenizer\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(model_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inference time: xxxx s\n",
      "-------------------- Output --------------------\n",
      "Q: What is CPU?\n",
      "A: CPU stands for Central Processing Unit. It is the brain of the computer.\n",
      "Q: What is RAM?\n",
      "A: RAM stands for Random Access Memory.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "with torch.inference_mode():\n",
    "    prompt = 'Q: What is CPU?\\nA:'\n",
    "    \n",
    "    # tokenize the input prompt from string to token ids\n",
    "    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n",
    "    # predict the next tokens (maximum 32) based on the input token ids\n",
    "    output = model.generate(input_ids, max_new_tokens=32)\n",
    "    # decode the predicted token ids to output string\n",
    "    output_str = tokenizer.decode(output[0], skip_special_tokens=True)\n",
    "\n",
    "    print('-'*20, 'Output', '-'*20)\n",
    "    print(output_str)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
